欧洲专利EP3709207A1 Visual question answering model, electronic device and storage medium

专利PDF首页>>欧洲专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Embodiments of the present disclosure disclose a visual question answering model, an electronic device and a storage medium. The visual question answering model includes an image encoder and a text encoder. The text encoder is configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text; and the image encoder is configured to extract an image feature of a given image in combination with the semantic representation vector. By processing a text vector through pooling, the embodiments according to the present disclosure ensure that model training efficiency is effectively improved on the premise of a small loss of prediction accuracy of the visual question answering model, and thus the model is beneficial to the use in engineering.
公开号:EP3709207A1
申请号:EP20150895.9
申请日:2020-01-09
公开日:2020-09-16
发明作者:Jianhui Huang；Min Qiao；Pingping Huang；Yong Zhu；Yajuan Lyu；Ying Li
申请人:Beijing Baidu Netcom Science and Technology Co Ltd；
IPC主号:G06N5-00

专利说明:
[0001] Embodiments of the present disclosure relate to a technical field of artificial intelligence, and more particularly, to a visual question answering model, an electronic device and a storage medium. BACKGROUND
[0002] The visual question answering (VQA) system is a typical application of multi-modality fusion. For example, for a given image in which there is a batter wearing red clothes, if a relevant question "what color shirt is the batter wearing " is presented, the VQA model needs to combine image information and text question information to predict that the answer as "red". This process mainly involves semantic feature extraction on the image and text, and fusion of features of two modalities: the extracted image and text, so that encoding of the VQA-related model mainly consists of a text encoder and an image encoder.
[0003] However, due to a need to involve both the image encoder and the text encoder, the VQA model usually contains a large number of parameters that require training, and thus time required for the model training is quite long. Therefore, on the premise that a loss of model accuracy is not great, how to improve training efficiency of the model by simplifying the model from the engineering point of view becomes a technical problem that needs to be solved urgently at present. SUMMARY
[0004] Embodiments of the present disclosure provide a visual question answering model, an electronic device and a storage medium, so as to achieve an aim of improving training efficiency of the model by simplifying the model from the engineering point of view on the premise that a loss of model accuracy is not great.
[0005] In a first aspect, an embodiment of the present disclosure provides a visual question answering model, including an image encoder and a text encoder, in which, the text encoder is configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text; and the image encoder is configured to extract an image feature of a given image in combination with the semantic representation vector.
[0006] In an embodiment, the text encoder is configured to: perform maxPooling processing or avgPooling processing on the word vector sequence of the question text to extract the semantic representation vector of the question text.
[0007] In an embodiment, the maxPooling processing is expressed by an equation of:f w 1 , w 2 , … … , wk = max w 1 , w 2 , … … , wk , dim = 1
[0008] In an embodiment, the avgPooling processing is expressed by an equation of:p w 1 , w 2 , … … , wk = ∑ i = 1 kwi k
[0009] In a second aspect, an embodiment of the present disclosure further provides an electronic device, including: one or more processors; and a storage device, configured to store one or more programs, in which when the one or more programs are executed by the one or more processors, the one or more processors are configured to operate a visual question answering model according to any one of the embodiments of the present disclosure.
[0010] In a third aspect, an embodiment of the present disclosure further provides a computer readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the program operates a visual question answering model according to any one of the embodiments of the present disclosure.
[0011] Embodiments of the present disclosure provide a visual question answering model, an electronic device and a storage medium. For the visual question answering model, a text vector is encoded by pooling to simplify the visual question answering model, and the quantity of parameters needing to be trained in the visual question answering model is reduced by the simple encoding mode of pooling, so that the training efficiency of the visual question answering model is effectively improved, and thus the model is beneficial to use in engineering. BRIEF DESCRIPTION OF THE DRAWINGS
[0012] FIG. 1 is a schematic diagram of a visual question answering model according to Embodiment 1 of the present disclosure. FIG. 2 is a schematic diagram of another visual question answering model according to Embodiment 2 of the present disclosure. FIG. 3 is a schematic diagram of an electronic device according to Embodiment 3 of the present disclosure.DETAILED DESCRIPTION
[0013] The present disclosure will be described in detail below with reference to the accompanying drawings and the embodiments. It may be understood that, the specific embodiments described herein are only used to explain the present disclosure rather than to limit the present disclosure. In addition, it should also be noted that, for convenience of description, only part but not all structures related to the present disclosure are illustrated in the accompanying drawings. Embodiment 1
[0014] FIG. 1 is a visual question answering model according to this embodiment of the present disclosure. This embodiment improves training efficiency of the visual question answering model by simplifying the visual question answering model. The model may be operated on an electronic device, such as a computer terminal or a server.
[0015] As illustrated in FIG. 1, the visual question answering model according to the embodiment of the present disclosure may include: a text encoder configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text.
[0016] Before the question text is encoded, the question text needs to be preprocessed. Illustratively, the question text is processed with a word2vec model or a glove model to obtain the word vector sequence corresponding to the question text. To encode the question text, the word vector sequence corresponding to the question text may be input into the text encoder, and then the text encoder performs pooling on the word vector sequence of the question text to extract the semantic representation vector of the question text. It should be noted that in the prior art, a LSTM (long short-term memory) model or a Bi-LSTM (bi-directional long short-term memory) model is configured as the text encoder. In the present disclosure, the pooling replaces the LSTM model or the Bi-LSTM model and is configured as the text encoder, and thus the visual question answering model is simplified.
[0017] In the embodiment, the pooling refers to maxPooling processing, which is expressed by an equation of:f w 1 , w 2 , … … , wk = max w 1 , w 2 , … … , wk , dim = 1
[0018] Illustratively, a word vector sequence of a question text is0.1 0.2 0.3 0.2 0.1 − 0.1 0.3 0.4 0.2 ,
[0019] In addition, an image encoder in the visual question answering model according to the embodiment of the present disclosure is configured to extract an image feature of a given image in combination with the semantic representation vector.
[0020] Since an image contains background and rich content, in order to ensure that the machine pays more attention to image content related to the question for improving the accuracy of the question answer, a visual attention mechanism (Attention in FIG. 1) may be used. With the Attention mechanism, the image encoder may, according to the semantic representation vector corresponding to the question text obtained in combination with the text encoder, lock image content have the highest relevance with the semantic representation vector, and may extract the image feature of the image content, so as to obtain an image feature vector. The image encoder may adopt a convolutional neural network model, such as a Faster RCNN model.
[0021] Further, as illustrated in FIG. 1, the visual question answering model includes a feature fusion for fusing features of different modalities. In this embodiment, the feature fusion is configured to fuse the image feature vector output by the image encoder and the semantic representation vector output by the text encoder. Illustratively, the image feature vector and the semantic representation vector may be fused by means of dot product.
[0022] The visual question answering model further includes a classifier that numerically processes the vector output by the feature fusion with a softmax function (also referred to as a normalized exponential function), so as to obtain a relative probability between different answers, and to output an answer corresponding to the maximum relative probability.
[0023] For the above-mentioned visual question answering model, in a specific implementation, a set of data Visual Genome released by the Stanford Artificial Intelligence Laboratory is used as training sample data and verification data. In addition, the training sample data and the verification data may be randomly divided by a ratio of 2:1, so as to train and to verify the visual question answering model. Specific data statistics of the set of data are shown in Table 1. Each image contains a certain number of questions, and the given answer is manually marked. Table 1Name Number the number of images 10,8077 the number of questions 1,445,322
[0024] The visual question answering model according to the embodiment is trained and verified by the above data. Specifically, the visual question answering model may be run on a P40 cluster, and environment configuration of the P40 cluster and basic parameters of the model are shown in Table 2. For comparison, visual question answering models using LSTM and Bi-LSTM respectively as the text encoders in the prior art are trained and verified simultaneously. The results are shown in Table 3.
[0025] It may be seen from the verification results listed in Table 3 that compared with the existing visual question answering model using LSTM or Bi-LSTM as the text encoder, the visual question answering model using the maxPooling processing as the text encoder according to the embodiment of the present disclosure has a merely 0.5% loss on prediction accuracy while shortening the running time of the model by up to 3 hours, so that the training efficiency is greatly improved. Table 2Name Configuration Additional System Centos6.0 Type of GPU P40 The memory space of the graphics card is 24G. Number of GPU 4 cards Batch_size 512 Epochs 12,000 Epoch is counted in mini-batch. Table 3Text Encoder Running Time Prediction Accuracy LSTM 7.5h 41.39% Bi-LSTM 8.2h 41.36% maxPooling 5.2h 40.84%
[0026] According to the embodiment of the present disclosure, for the visual question answering model, the text vector is encoded by pooling to simplify the visual question answering model, and through the simple encoding manner of pooling, the model achieves that the training efficiency of the visual question answering model is effectively improved on the premise of a small loss of prediction accuracy of the visual question answering model, and thus the model is beneficial to the use in engineering. Embodiment 2
[0027] FIG. 2 is a schematic diagram of another visual question answering model according to this embodiment of the present disclosure. As shown in FIG. 2, the visual question answering model includes: the text encoder, wherein the text encoder is configured to perform pooling on the word vector sequence of the question text inputted, so as to extract the semantic representation vector of the question text.
[0028] The pooling refers to an avgPooling processing, which may be expressed by an equation of:p w 1 , w 2 , … … , wk = ∑ i = 1 kwi k
[0029] Illustratively, a word vector sequence of a question text is0.1 0.2 0.3 0.2 0.1 − 0.1 0.3 0.4 0.2 ,
[0030] In addition, the image encoder in the visual question answering model according to the embodiment of the present disclosure is configured to extract the image feature of the given image in combination with the semantic representation vector.
[0031] Further, the visual question answering model further includes the feature fusion and the classifier. Reference to the feature fusion and the classifier may be made to the above embodiment, and repeated description is omitted herein.
[0032] The visual question answering model according to the embodiment is trained and verified on the afore-mentioned P40 cluster with the aforementioned set of data Visual Genome. In addition, visual question answering models using LSTM and Bi-LSTM respectively as the text encoders in the prior art are trained and verified simultaneously. The results are shown in Table 4. Table 4Text Encoder Running Time Prediction Accuracy LSTM 7.5h 41.39% Bi-LSTM 8.2h 41.36% avgPooling 5.8h 40.96%
[0033] It may be seen from Table 4 that compared with the existing visual question answering model using LSTM or Bi-LSTM as the text encoder, the visual question answering model using the avgPooling processing as the text encoder according to the embodiment of the present disclosure has a merely 0.4% loss on prediction accuracy while shortening the running time of the model by up to 2.4 hours, so that the training efficiency is improved.
[0034] According to the embodiment of the present disclosure, for the visual question answering model, the text vector is encoded by the avgPooling processing to simplify the visual question answering model, and through the simple encoding manner of the avgPooling processing, the model achieves that the training efficiency of the visual question answering model is effectively improved on the premise of a small loss of prediction accuracy of the visual question answering model, and thus the model is beneficial to the use in engineering. Embodiment 3
[0035] FIG. 3 is a schematic diagram of an electronic device according to this embodiment of the present disclosure. FIG. 3 is a block diagram of an electronic device 12 for implementing embodiments of the present disclosure. The electronic device 12 illustrated in FIG. 3 is only illustrated as an example, and should not be considered as any restriction on the function and the usage range of embodiments of the present disclosure.
[0036] As illustrated in FIG. 3, the electronic device 12 is represented in a form of a general-purpose computing apparatus. The electronic device 12 may include, but is not limited to, one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processor 16).
[0037] The bus 18 represents one or more of several types of bus architectures, including a memory bus or a memory control bus, a peripheral bus, a graphic acceleration port (GAP) bus, a processor bus, or a local bus using any bus architecture in a variety of bus architectures. For example, these architectures include, but are not limited to, an industry standard architecture (ISA) bus, a micro-channel architecture (MCA) bus, an enhanced ISA bus, a video electronic standards association (VESA) local bus, and a peripheral component interconnect (PCI) bus.
[0038] Typically, the electronic device 12 may include multiple kinds of computer-readable media. These media may be any storage media accessible by the electronic device 12, including transitory or non-transitory storage medium and movable or unmovable storage medium.
[0039] The memory 28 may include a computer-readable medium in a form of volatile memory, such as a random access memory (RAM) 30 and/or a high-speed cache memory 32. The electronic device 12 may further include other transitory/non-transitory storage media and movable/unmovable storage media. In way of example only, the storage system 34 may be configured to read and write non-removable, non-volatile magnetic media (not shown in the figure, commonly referred to as "hard disk drives"). Although not illustrated in FIG. 3, it may be provided a disk driver for reading and writing movable non-volatile magnetic disks (e.g. "floppy disks"), as well as an optical driver for reading and writing movable non-volatile optical disks (e.g. a compact disc read only memory (CD-ROM, a digital video disc read only Memory (DVD-ROM), or other optical media). In these cases, each driver may be connected to the bus 18 via one or more data medium interfaces. The memory 28 may include at least one program product, which has a set of (for example at least one) program modules configured to perform the functions of embodiments of the present disclosure.
[0040] A program/application 40 with a set of (at least one) program modules 42 may be stored in memory 28, the program modules 42 may include, but not limit to, an operating system, one or more application programs, other program modules and program data, and any one or combination of above examples may include an implementation in a network environment. The program modules 42 are generally configured to implement functions and/or methods described in embodiments of the present disclosure.
[0041] The electronic device 12 may also communicate with one or more external devices 14 (e.g., a keyboard, a pointing device, a display 24, and etc.) and may also communicate with one or more devices that enables a user to interact with the computer system/electronic device 12, and/or any device (e.g., a network card, a modem, and etc.) that enables the computer system/electronic device 12 to communicate with one or more other computing devices. This kind of communication can be achieved by the input/output (I/O) interface 22. In addition, the electronic device 12 may be connected to and communicate with one or more networks such as a local area network (LAN), a wide area network (WAN) and/or a public network such as the Internet through a network adapter 20. As shown in Fig.9, the network adapter 20 communicates with other modules of the electronic device 12 over bus 18. It should be understood that although not shown in the figure, other hardware and/or software modules may be used in combination with the electronic device 12, which including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, as well as data backup storage systems and the like.
[0042] The processor 16 can perform various functional applications and data processing by running programs stored in the system memory 28, for example, to run the visual question answering model according to embodiments of the present disclosure. The visual question answering model includes: an image encoder and a text encoder, in which the text encoder is configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text; and the image encoder is configured to extract an image feature of a given image in combination with the semantic representation vector. Embodiment 4
[0043] Embodiment 4 of the present disclosure provides a storage medium including a computer readable storage medium. The storage medium stores the visual question answering model according to the embodiment of the present disclosure and is run by a computer processor. The visual question answering model includes: an image encoder and a text encoder, wherein the text encoder is configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text; and the image encoder is configured to extract an image feature of a given image in combination with the semantic representation vector.
[0044] Certainly, the computer readable storage medium according to the embodiment of the present disclosure may also execute a visual question answering model according to any embodiment of the present disclosure.
[0045] The computer storage medium may adopt any combination of one or more computer readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, but is not limited to, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, component or any combination thereof. A specific example of the computer readable storage media include (a non-exhaustive list): an electrical connection having one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read only memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or a flash memory, an optical fiber, a compact disc read-only memory (CD-ROM), an optical memory component, a magnetic memory component, or any suitable combination thereof. In context, the computer readable storage medium may be any tangible medium including or storing programs. The programs may be used by an instruction executed system, apparatus or device, or a connection thereof.
[0046] The computer readable signal medium may include a data signal propagating in baseband or as part of carrier which carries a computer readable program codes. Such propagated data signal may be in many forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination thereof. The computer readable signal medium may also be any computer readable medium other than the computer readable storage medium, which may send, propagate, or transport programs used by an instruction executed system, apparatus or device, or a connection thereof.
[0047] The program code stored on the computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, or any suitable combination thereof.
[0048] The computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages. The programming language includes an object oriented programming language, such as Java, Smalltalk, C ++, as well as conventional procedural programming language, such as "C" language or similar programming language. The program code may be executed entirely on a user's computer, partly on the user's computer, as a separate software package, partly on the user's computer, partly on a remote computer, or entirely on the remote computer or server. In a case of the remote computer, the remote computer may be connected to the user's computer or an external computer (such as using an Internet service provider to connect over the Internet) through any kind of network, including a Local Area Network (hereafter referred as to LAN) or a Wide Area Network (hereafter referred as to WAN).
[0049] It should be noted that, the above are only preferred embodiments and applied technical principles of the present disclosure. Those skilled in the art should understand that, the present disclosure is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions that are made by those skilled in the art will not depart from the scope of the present disclosure. Therefore, although the present disclosure has been described in detail by the above embodiments, the present disclosure is not limited to the above embodiments, and more other equivalent embodiments may be included without departing from the concept of the present disclosure, and the scope of the present disclosure is determined by the scope of the appended claims.

权利要求:
Claims (6)
[0001] A visual question answering model, comprising an image encoder and a text encoder,wherein the text encoder is configured to perform pooling on a word vector sequence of a question text inputted, so as to extract a semantic representation vector of the question text; andthe image encoder is configured to extract an image feature of a given image in combination with the semantic representation vector.
[0002] The model according to claim 1, wherein the text encoder is configured to:perform maxPooling processing or avgPooling processing on the word vector sequence of the question text to extract the semantic representation vector of the question text.
[0003] The model according to claim 2, wherein the maxPooling processing is expressed by an equation of:f w 1 , w 2 , … … , wk = max w 1 , w 2 , … … , wk , dim = 1
[0004] The model according to claim 2, wherein the avgPooling processing is expressed by an equation of:p w 1 , w 2 , … … , wk = ∑ i = 1 kwi k
[0005] An electronic device, comprising:
one or more processors; and
a storage device, configured to store one or more programs,
wherein when the one or more programs are executed by the one or more processors, the one or more processors are configured to operate a visual question answering model according to any one of claims 1 to 4.
[0006] A computer readable storage medium having a computer program stored thereon, wherein when the program is executed by a processor, the program operates a visual question answering model according to any one of claims 1 to 4.

类似技术:

公开号 | 公开日 | 专利标题

AU2016203856B2|2017-02-23|System and method for automating information abstraction process for documents

US20210073461A1|2021-03-11|Removing personal information from text using multiple levels of redaction

Jones2014|Computer science: The learning machines

US8645138B1|2014-02-04|Two-pass decoding for speech recognition of search and action requests

US20150036920A1|2015-02-05|Convolutional-neural-network-based classifier and classifying method and training methods for the same

CN107437416B|2020-11-17|Consultation service processing method and device based on voice recognition

Card et al.1986|The model human processor- An engineering model of human performance

CN1457041B|2010-05-26|System for automatically annotating training data for natural language understanding system

US5251291A|1993-10-05|Method of selectively transferring video displayed information

CN107220235B|2021-01-22|Speech recognition error correction method and device based on artificial intelligence and storage medium

DE112012005222T5|2014-08-28|Semiconductor data storage management

CN105144040A|2015-12-09|Communication context based predictive-text suggestion

US20080260252A1|2008-10-23|System, Method, and Apparatus for Continuous Character Recognition

CN107704625B|2021-01-15|Method and device for field matching

US20170337479A1|2017-11-23|Machine comprehension of unstructured text

US10553202B2|2020-02-04|Method, apparatus, and system for conflict detection and resolution for competing intent classifiers in modular conversation system

CN109104620B|2020-05-19|Short video recommendation method and device and readable medium

DE112013004614T5|2015-07-02|Gestentastatur with Gestannannullierung

US10217030B2|2019-02-26|Hieroglyphic feature-based data processing

US9978074B2|2018-05-22|Automated experiment scheduling

US20190287142A1|2019-09-19|Method, apparatus for evaluating review, device and storage medium

US8879854B2|2014-11-04|Method and apparatus for recognizing an emotion of an individual based on facial action units

US9298690B2|2016-03-29|Method for analyzing emotion based on messenger conversation

CN108920467B|2021-04-27|Method and device for learning word meaning of polysemous word and search result display method

CN108959246A|2018-12-07|Answer selection method, device and electronic equipment based on improved attention mechanism

同族专利:

公开号 | 公开日

KR20200110154A|2020-09-23|

US20200293921A1|2020-09-17|

CN109902166A|2019-06-18|

JP2020149685A|2020-09-17|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

法律状态:
2020-08-14| STAA| Information on the status of an ep patent application or granted ep patent|Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |

2020-08-14| PUAI| Public reference made under article 153(3) epc to a published international application that has entered the european phase|Free format text: ORIGINAL CODE: 0009012 |

2020-09-16| AK| Designated contracting states|Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |

2020-09-16| AX| Request for extension of the european patent|Extension state: BA ME |

2021-03-19| STAA| Information on the status of an ep patent application or granted ep patent|Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |

2021-04-21| RBV| Designated contracting states (corrected)|Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |

2021-04-21| 17P| Request for examination filed|Effective date: 20210315 |

优先权:

申请号 | 申请日 | 专利标题

[返回顶部]